Mini-Project: SVM & LR Classification

Rain in Australia: Predict Rain Tomorrow
Created by An Nguyen, Andy Ho, Jodi Pafford, Tori Wheelis
February 14, 2019

Create Models

The following is code for pre-processing, defined functions used, logistic regression, and support vector machines.

In [1]:
import pandas as pd
import numpy as np

##Load original dataset.  No longer needed.
##The cleaning process is very computationally expensive therefore a the 'ranifall.csv' file was created for later use. 
#rainfall_original = pd.read_csv('weatherAus.csv') 

##load pre-generated rainfall.csv file.
rainfall = pd.read_csv('rainfall.csv', index_col=0) 
rainfall.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 140787 entries, 0 to 140786
Data columns (total 23 columns):
Date             140787 non-null object
Location         140787 non-null object
MinTemp          140787 non-null float64
MaxTemp          140787 non-null float64
Rainfall         140787 non-null float64
Evaporation      97184 non-null float64
Sunshine         89329 non-null float64
WindGustDir      134862 non-null float64
WindGustSpeed    134862 non-null float64
WindDir9am       140787 non-null float64
WindDir3pm       140787 non-null float64
WindSpeed9am     140787 non-null float64
WindSpeed3pm     140787 non-null float64
Humidity9am      140787 non-null float64
Humidity3pm      140787 non-null float64
Pressure9am      129190 non-null float64
Pressure3pm      129190 non-null float64
Cloud9am         107253 non-null float64
Cloud3pm         107253 non-null float64
Temp9am          140787 non-null float64
Temp3pm          140787 non-null float64
RainToday        140787 non-null int64
RainTomorrow     140787 non-null int64
dtypes: float64(19), int64(2), object(2)
memory usage: 25.8+ MB
In [2]:
#Functions to find the average value using the bracketing values around the NaN's.  
    #For instance if a city's 'MinTemp' has 34, 32, NaN, NaN, 55 recorded 
    #the function will average 32 and 55 for the first NaN: (32+55)/2 = 43.5 
    #and average the above value and 55 for the second NaN: (43.5+55)/2 = 49.25
#Will only use values if they are from the same city.
#If NaN is the earliest timepoint for a given city the next timepoint with no NaN will be given instead of the mean.
#If NaN is the latest timepoint for a given city the previous timepoint with no NaN will be given instead of the mean.

def impute_by_city(cities,variables):
    for c in cities:
        #aPrse out observations from a single city.
        temp = rainfall[rainfall.Location == c]
        
        #Interate through all observations of the temp data file.
        i = min(temp.index)
        while i <= max(temp.index):
            for v in variables:
                #Check to see if there are values recorded for the variable, will pass over if all are NaN.
                if pd.isna(temp[v]).all():
                    pass
                
                #Check to see if a single value is NaN.
                elif pd.isna(temp[v][i]):
                    #Find the mean of bracketing values and impute into main dataframe.
                    temp[v][i] = find_mean(temp[v], i)
                    rainfall[v][i] = temp[v][i]
            i = i + 1       

#Find mean of bracketing values.
def find_mean(templist, index):
    #If NaN is earliest timepoint for the city take the next value that is not NaN.
    if index == min(templist.index): 
        return find_top(templist, index)
    
    #If latest timepoint for the city take the previous value that is not NaN.
    elif index == max(templist.index): 
        return find_bottom(templist, index)
    
    else:
        #Find previous non-NaN value.
        bottom = find_bottom(templist, index) 
        #Find next non-NaN value.
        top = find_top(templist, index) 
        
    #If current value is not from the latest timepoint for the city but there are no more non-NaN value recorded
    #after this value then the previous non-NaN value will be taken.
    if pd.isna(top): 
        return bottom
    

    else:
        mean = (top + bottom)/2
        return mean

#Find previous non-NaN value.
def find_bottom(templist, index):
    while pd.isna(templist[index-1]):
        index = index-1
    bottom = templist[index-1]
    return bottom

#Find next non-NaN value.
#If there are no more non-NaN values return the previous non-NaN value.
def find_top(templist, index):
    while pd.isna(templist[index+1]):
        index = index+1
        if index == max(templist.index):
            top = np.nan
            return top
    top = templist[index+1]
    return top   
In [3]:
##Code for first run data cleaning, no longer needed after 'rainfall.csv' was created at the end of cleaning process.

#rainfall = rainfall_original.copy()

##'RISK_MM' was used by creator of dataset to extrapolate response variable, 'RainTomorrow.'  Needs to be dropped to not 
##influence prediction.
#rainfall.drop(["RISK_MM"], axis=1, inplace=True)

##Drop any observation with no record of rainfall for the day.  Cannot be imputed.
#rainfall.dropna(subset=["RainToday"], inplace=True)

#Reset the Index of each observation to match it's iloc, get rid of gaps between Index integers.
#rainfall = rainfall.reset_index(drop=True)
#rainfall.info()
In [4]:
##can be skipped if rainfall.csv already generated!

##set the cardinal directions to degrees.
#directions = {'N':0, 'NNE':22.5, 'NE':45, 'NE':45, 'ENE':67.5, 'E':90, 'ESE':112.5, 'SE':135, 'SSE':157.5, 'S':180,\
#              'SSW':202.5, 'SW':225, 'WSW':247.5, 'W':270, 'WNW':292.5, 'NW':315, 'NNW':337.5}

##Replace cardianl direction to their corresponding degrees.
#rainfall = rainfall.replace(directions) 

#Get name of all cities in the data frame.
#cities = rainfall.Location.unique() 

#c_variables = []
#d_variables = []

##change 'Yes' and 'No' to 1 and 0 respectively.
#rainfall.RainToday = rainfall.RainToday=='Yes'
#rainfall.RainToday = rainfall.RainToday.astype(np.int)
#rainfall.RainTomorrow = rainfall.RainTomorrow=='Yes'
#rainfall.RainTomorrow = rainfall.RainTomorrow.astype(np.int)

##Find all variables with continous data type.
#for l in list(rainfall):
#    if (rainfall[l].dtypes == 'float64'):
#        c_variables.append(l)
#    else:
#        d_variables.append(l)
In [5]:
##can be skipped if rainfall.csv already generated! Very expensive, 'rainfall.csv' can be uploaded from working directory

##Impute values to NaN's and save to csv file for later use.
#impute_by_city(cities, c_variables)
#rainfall.to_csv("rainfall.csv", sep=',', index=True)
In [6]:
#Variables 'Evaporation' and 'Sunshine' contained many missing values, too many to be imputed.
rainfall = rainfall.drop(['Evaporation', 'Sunshine'], axis = 1)

#Get name of all cities in the data frame.
l = list(rainfall.Location.unique())

#Drop all observations with NaN's.  These are values that could not be imputed using the above code.
rainfall.dropna(subset = list(rainfall), inplace = True)

#List all cities that were dropped
for i in l:
    if i not in rainfall.Location.unique():
        print(i)
        
#'Date' and 'Location' variables not needed for prediction. 
rainfall = rainfall.drop(['Date', 'Location'], axis = 1) 

##Make a copy of main data frame to be use by Logistic Regression
lr_rainfall = rainfall.copy()

##Make a copy of main data frame to be use by Support Vector Machines
svm_rainfall = rainfall.copy()
BadgerysCreek
Newcastle
NorahHead
Penrith
Tuggeranong
MountGinini
Nhil
Dartmoor
GoldCoast
Adelaide
Albany
Witchcliffe
SalmonGums
Walpole

Logistic Regression

In [7]:
from sklearn.model_selection import ShuffleSplit as ss

#Assign values to response variable, y, and explanatory variables, x.
if 'RainTomorrow' in lr_rainfall:
    #Response variable is 'RainTomorrow'
    y = lr_rainfall['RainTomorrow'].values
    
    #Remove response variable from dataframe
    del lr_rainfall['RainTomorrow']
    
    #Everything else is the explanatory variables used in prediction.
    x = lr_rainfall.values 
    
#Split our data into training and testing sets, 80% of data will be in the training set and 20% the testing set.
#Data will be process this way 5 times, value can be change per user's judgement.  It is recommended that number
#of iterations be at least 2 so that standard deviations can be computed.
num_cv_iterations = 5
num_instances = len(y)
cv_object = ss(n_splits=num_cv_iterations, test_size  = 0.2)
In [8]:
from sklearn import metrics as mt
from sklearn.preprocessing import StandardScaler as sts
from sklearn.linear_model import LogisticRegression as lr
import time

column_names = lr_rainfall.columns
weights = []
weights_array = []

scl_obj = sts()
t0=time.time()

#Split the data into training and testing set 5 different ways, iterate through each way.
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(x,y)):
    
    #Standardize the explanatory variables of the training and testing sets' means to be around 0 with a standard deviation of 1.  
    #Each value is subtracted from the mean and divided by the standard deviation of the whole dataset.
    scl_obj.fit(x[train_indices])
    X_train_scaled = scl_obj.transform(x[train_indices])
    X_test_scaled = scl_obj.transform(x[test_indices])

    #Perform Logistic Regression on training set.
    lr_clf = lr(penalty='l2', C=0.05)
    lr_clf.fit(X_train_scaled,y[train_indices])

    #Perform prediction using the scaled explanatory variables of the testing set.
    y_hat = lr_clf.predict(X_test_scaled)
    
    #Find accuracy and confusion matrix of predition above
    acc = mt.accuracy_score(y[test_indices],y_hat)
    conf = mt.confusion_matrix(y[test_indices],y_hat)
    print("")
    print('accuracy:', acc )
    print(conf )
    print ("Time to Run:", time.time()-t0)
    
    #Get the names of each variables and weight computed by the regression model
    zip_vars = pd.Series(lr_clf.coef_[0].T, index=column_names)
    for name, coef in zip_vars.items():
        #Print out names and weight of each variable
        print(name, 'has weight of', coef)
        
        #Add weights computed by current iteration of training and testing split.
        weights.append(coef)
    
    #Add all the weights of each iteration into one master array.  
    weights_array.append(weights)
    
    #reset weights variable for next iteration.
    weights = []

#Convert weights_array into a numpy array.
weights_array = np.array(weights_array)
accuracy: 0.8563604065923222
[[15207   768]
 [ 2143  2148]]
Time to Run: 0.932466983795166
MinTemp has weight of 0.08491863660392236
MaxTemp has weight of 0.13385552206800927
Rainfall has weight of 0.08164561220496877
WindGustDir has weight of 0.026312358960561454
WindGustSpeed has weight of 0.7016103632185202
WindDir9am has weight of -0.11177871386762439
WindDir3pm has weight of 0.08539932261920914
WindSpeed9am has weight of -0.07817981393854864
WindSpeed3pm has weight of -0.23719910700365127
Humidity9am has weight of 0.08596490446599352
Humidity3pm has weight of 1.1240427979666148
Pressure9am has weight of 1.0561355541349242
Pressure3pm has weight of -1.4700440378379795
Cloud9am has weight of 0.15260703597847308
Cloud3pm has weight of 0.3713360944736593
Temp9am has weight of 0.1690180503249278
Temp3pm has weight of -0.4296788547611217
RainToday has weight of 0.19147983366831536

accuracy: 0.8514753774795224
[[15121   842]
 [ 2168  2135]]
Time to Run: 1.829068660736084
MinTemp has weight of 0.07506891134875192
MaxTemp has weight of 0.12459751438014227
Rainfall has weight of 0.08899832269944588
WindGustDir has weight of 0.03216880229750925
WindGustSpeed has weight of 0.7133230624287965
WindDir9am has weight of -0.10538560737581755
WindDir3pm has weight of 0.08060308187441878
WindSpeed9am has weight of -0.06954503019388948
WindSpeed3pm has weight of -0.2490874709925909
Humidity9am has weight of 0.09091259821665681
Humidity3pm has weight of 1.1460348795310547
Pressure9am has weight of 1.0411555548683722
Pressure3pm has weight of -1.450853435271526
Cloud9am has weight of 0.14748570287684162
Cloud3pm has weight of 0.3827790490545101
Temp9am has weight of 0.1549215249234521
Temp3pm has weight of -0.37869455330297547
RainToday has weight of 0.18166616467353186

accuracy: 0.8538932201717162
[[15176   834]
 [ 2127  2129]]
Time to Run: 2.6518704891204834
MinTemp has weight of 0.047145004168777864
MaxTemp has weight of 0.13383232653641206
Rainfall has weight of 0.08456637265745731
WindGustDir has weight of 0.03673159906697011
WindGustSpeed has weight of 0.7122869130521494
WindDir9am has weight of -0.10394688141341132
WindDir3pm has weight of 0.08504462184253946
WindSpeed9am has weight of -0.06571715563473518
WindSpeed3pm has weight of -0.2574067115562508
Humidity9am has weight of 0.1008710007919899
Humidity3pm has weight of 1.1326168444830733
Pressure9am has weight of 1.0561543544859224
Pressure3pm has weight of -1.4641507468574346
Cloud9am has weight of 0.16140850304160148
Cloud3pm has weight of 0.3677925700471666
Temp9am has weight of 0.20061596212840116
Temp3pm has weight of -0.4095985612752064
RainToday has weight of 0.19512816137828695

accuracy: 0.8546827198263101
[[15124   776]
 [ 2169  2197]]
Time to Run: 3.418818950653076
MinTemp has weight of 0.03429325871120354
MaxTemp has weight of 0.08290633876167175
Rainfall has weight of 0.08557877121266279
WindGustDir has weight of 0.0319642712562779
WindGustSpeed has weight of 0.7118139236133727
WindDir9am has weight of -0.10488046273534643
WindDir3pm has weight of 0.08651681260136873
WindSpeed9am has weight of -0.0651608150732557
WindSpeed3pm has weight of -0.25857652241767753
Humidity9am has weight of 0.09152519753956427
Humidity3pm has weight of 1.1430834405559647
Pressure9am has weight of 1.0735740429687568
Pressure3pm has weight of -1.4845874612669148
Cloud9am has weight of 0.1543609166421011
Cloud3pm has weight of 0.36929439085073
Temp9am has weight of 0.21080985687731266
Temp3pm has weight of -0.3558210840531975
RainToday has weight of 0.19395025336843527

accuracy: 0.853399782887595
[[15086   808]
 [ 2163  2209]]
Time to Run: 4.191751718521118
MinTemp has weight of 0.06316156410484959
MaxTemp has weight of 0.09251153796006142
Rainfall has weight of 0.08759952701499257
WindGustDir has weight of 0.03884931832339645
WindGustSpeed has weight of 0.7254456176790958
WindDir9am has weight of -0.11173778063715209
WindDir3pm has weight of 0.08031716451590577
WindSpeed9am has weight of -0.07811942496325962
WindSpeed3pm has weight of -0.2556817993295474
Humidity9am has weight of 0.09075811424088506
Humidity3pm has weight of 1.1406641960003872
Pressure9am has weight of 1.0535094473409357
Pressure3pm has weight of -1.4731711080960659
Cloud9am has weight of 0.14789210985885298
Cloud3pm has weight of 0.3660911078692227
Temp9am has weight of 0.18027179952714723
Temp3pm has weight of -0.3685734353550839
RainToday has weight of 0.19173558661458212
In [9]:
import plotly as ply

#Run at the start of plotly.
ply.offline.init_notebook_mode() 

#Get mean weights for each variable
mean_weights = np.mean(weights_array,axis = 0)

#Get standard deviation of weights for each variable
std_weights = np.std(weights_array,axis = 0)

#Make one array with variable names as the Index, and a column of means and a column of standard deviation for each variable.
final_array = pd.DataFrame(data={'mean':mean_weights, 'std':std_weights}, index = column_names)

#Sort the variable in ascending order using the 'mean' column.
final_array = final_array.sort_values(by=['mean'])

#Error bar.
error_y=dict(type='data', array=final_array['std'].values, visible=True)

#Graph mean with standard deviation of each variable.
graph1 = {'x': final_array.index, 'y': final_array['mean'].values, 'error_y':error_y,'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}

ply.offline.iplot(fig)

#Grab coefficents with absolute values greater than 0.5, this can be set by the user.
cutoff = 0.5
lr_voi = []
for index, columns in final_array.iterrows():
    if (columns['mean'] > cutoff) or (columns['mean'] < -cutoff):
        lr_voi.append(index)
In [10]:
from matplotlib import pyplot as plt

#Add response variable back to expalantory dataset.
lr_rainfall['RainTomorrow'] = y 

#Group observations by the response variable, 1's together and 0's together.
df_grouped = lr_rainfall.groupby(['RainTomorrow'])

#Plot kernal desity extimation of variables with weights higher than the user defined value set above, 0.5 in this case.
vars_to_plot = lr_voi

for v in vars_to_plot:
    plt.figure(figsize=(10,4))
    plt.subplot(1,1,1)
    ax = df_grouped[v].plot.kde() 
    plt.legend(['no rain','rained'])
    plt.title(v+' (Original)')

Support Vector Machines

In [11]:
#Assign values to response variable, y, and explanatory variables, x.
if 'RainTomorrow' in svm_rainfall:
    #Response variable is 'RainTomorrow'
    y = svm_rainfall['RainTomorrow'].values
    
    #Remove response variable from dataframe
    del svm_rainfall['RainTomorrow']
    
    #Everything else is the explanatory variables used in prediction.
    x = svm_rainfall.values 
    
#Split our data into training and testing sets, 80% of data will be in the training set and 20% the testing set.
#Data will be process this way 5 times, value can be change per user's judgement.  It is recommended that number
#of iterations be at least 2 so that standard deviations can be computed.
num_cv_iterations = 5
num_instances = len(y)
cv_object = ss(n_splits=num_cv_iterations, test_size  = 0.2)
In [12]:
from sklearn.svm import SVC 

weights = []
weights_array = []
t0=time.time()

#Split the data into training and testing set 5 different ways, iterate through each way.
for train_indices, test_indices in cv_object.split(x,y): 
    
    #Standardize the explanatory variables of the training and testing sets' means to be around 0 with a standard deviation of 1.  
    #Each value is subtracted from the mean and divided by the standard deviation of the whole dataset.
    scl_obj.fit(x[train_indices])
    X_train_scaled = scl_obj.transform(x[train_indices])
    X_test_scaled = scl_obj.transform(x[test_indices])

    #Perform Support Vector Machine on training set.
    svm_clf = SVC(C=0.5, kernel='linear', degree=3, gamma='auto')
    svm_clf.fit(X_train_scaled, y[train_indices])

    #Perform prediction using the scaled explanatory variables of the testing set.
    y_hat = svm_clf.predict(X_test_scaled)
    
    #Find accuracy and confusion matrix of predition above
    acc = mt.accuracy_score(y[test_indices],y_hat)
    conf = mt.confusion_matrix(y[test_indices],y_hat)
    print("")
    print('accuracy:', acc )
    print(conf)
    print ("Time to Run:", time.time()-t0)
    
    #Get the names of each variables and weight computed by the regression model
    zip_vars = pd.Series(svm_clf.coef_[0],index=column_names)
    for name, coef in zip_vars.items():
        #Print out names and weight of each variable
        print(name, 'has weight of', coef)
        
        #Add weights computed by current iteration of training and testing split.
        weights.append(coef)
    
    #Add all the weights of each iteration into one master array.  
    weights_array.append(weights)
    
    #reset weights variable for next iteration.
    weights = []
    
#Convert weights_array into a numpy array.
weights_array = np.array(weights_array)  
accuracy: 0.8498963781703346
[[15173   730]
 [ 2312  2051]]
Time to Run: 186.1053991317749
MinTemp has weight of -0.03451697097091255
MaxTemp has weight of 0.28083423445389144
Rainfall has weight of 0.10565227237952968
WindGustDir has weight of 0.013387121692403525
WindGustSpeed has weight of 0.4854877080422284
WindDir9am has weight of -0.06991188199003773
WindDir3pm has weight of 0.07373495302847743
WindSpeed9am has weight of -0.027092582524915088
WindSpeed3pm has weight of -0.2231282757546751
Humidity9am has weight of 0.005764672114992209
Humidity3pm has weight of 0.8188663213622931
Pressure9am has weight of 0.7688547226448463
Pressure3pm has weight of -1.0220388533844016
Cloud9am has weight of 0.06747625647926725
Cloud3pm has weight of 0.17360584124503475
Temp9am has weight of 0.04501926779479959
Temp3pm has weight of -0.32834678735844136
RainToday has weight of 0.12882715980163084

accuracy: 0.8518701273068193
[[15194   667]
 [ 2335  2070]]
Time to Run: 382.45932626724243
MinTemp has weight of -0.044149031572942476
MaxTemp has weight of 0.3276422977023685
Rainfall has weight of 0.09696052553886148
WindGustDir has weight of 0.013452777945360594
WindGustSpeed has weight of 0.47257741643034024
WindDir9am has weight of -0.06751297035032167
WindDir3pm has weight of 0.07524290672490963
WindSpeed9am has weight of -0.02843753377521807
WindSpeed3pm has weight of -0.20918015671873036
Humidity9am has weight of -0.026380995811678076
Humidity3pm has weight of 0.8514966065763474
Pressure9am has weight of 0.7478351971551547
Pressure3pm has weight of -0.9938102670553235
Cloud9am has weight of 0.08208981992879671
Cloud3pm has weight of 0.17508086545672086
Temp9am has weight of -0.01273946493739686
Temp3pm has weight of -0.2938811528416636
RainToday has weight of 0.13326533202166502

accuracy: 0.8501430968123952
[[15270   700]
 [ 2337  1959]]
Time to Run: 566.4134185314178
MinTemp has weight of -0.07318165445883551
MaxTemp has weight of 0.28025107445694175
Rainfall has weight of 0.10370811796110502
WindGustDir has weight of 0.020145518112769878
WindGustSpeed has weight of 0.4903674088155867
WindDir9am has weight of -0.06523085526418981
WindDir3pm has weight of 0.06840600824668286
WindSpeed9am has weight of -0.022895319238557477
WindSpeed3pm has weight of -0.22068807767811904
Humidity9am has weight of -0.00991680203787837
Humidity3pm has weight of 0.845439997644462
Pressure9am has weight of 0.7633685608677752
Pressure3pm has weight of -1.0044973527175216
Cloud9am has weight of 0.07487376490325914
Cloud3pm has weight of 0.16745393196265468
Temp9am has weight of 0.0645708613761542
Temp3pm has weight of -0.3030174652262758
RainToday has weight of 0.12898664846966312

accuracy: 0.8543866574558374
[[15272   690]
 [ 2261  2043]]
Time to Run: 752.2365188598633
MinTemp has weight of -0.034102483421520446
MaxTemp has weight of 0.2793730317697509
Rainfall has weight of 0.10129349610076588
WindGustDir has weight of 0.010726982581616085
WindGustSpeed has weight of 0.48109197951907845
WindDir9am has weight of -0.07982374467263753
WindDir3pm has weight of 0.07440312591987208
WindSpeed9am has weight of -0.028985784540395798
WindSpeed3pm has weight of -0.21890784924869422
Humidity9am has weight of -0.005184608175795802
Humidity3pm has weight of 0.8187119504304974
Pressure9am has weight of 0.7575321707759031
Pressure3pm has weight of -1.005785616616663
Cloud9am has weight of 0.07598953022375099
Cloud3pm has weight of 0.16613901033383627
Temp9am has weight of 0.04287037933671911
Temp3pm has weight of -0.3247902592977425
RainToday has weight of 0.1292241963581091

accuracy: 0.8522648771341162
[[15245   676]
 [ 2318  2027]]
Time to Run: 1098.012132883072
MinTemp has weight of -0.0401162598668634
MaxTemp has weight of 0.26950148823982545
Rainfall has weight of 0.1076246297260468
WindGustDir has weight of 0.014353856503760198
WindGustSpeed has weight of 0.479131611642174
WindDir9am has weight of -0.07518263335506958
WindDir3pm has weight of 0.06867452669126806
WindSpeed9am has weight of -0.030974111063756027
WindSpeed3pm has weight of -0.21009561121093157
Humidity9am has weight of -0.009739827176645122
Humidity3pm has weight of 0.829519327406615
Pressure9am has weight of 0.752872722455777
Pressure3pm has weight of -0.9946166164528449
Cloud9am has weight of 0.07576475149176076
Cloud3pm has weight of 0.1692737929984105
Temp9am has weight of 0.036034807599278906
Temp3pm has weight of -0.2977600614644871
RainToday has weight of 0.12831372046298384
In [13]:
#look at the support vectors
print(svm_clf.support_vectors_.shape)
print(svm_clf.support_.shape)
print(svm_clf.n_support_ )
(28244, 18)
(28244,)
[14125 14119]
In [14]:
#Run at the start of plotly.
ply.offline.init_notebook_mode() 

#Get mean weights for each variable
mean_weights = np.mean(weights_array,axis = 0)

#Get standard deviation of weights for each variable
std_weights = np.std(weights_array,axis = 0)

#Make one array with variable names as the Index, and a column of means and a column of standard deviation for each variable.
final_array = pd.DataFrame(data={'mean':mean_weights, 'std':std_weights}, index = column_names)

#Sort the variable in ascending order using the 'mean' column.
final_array = final_array.sort_values(by=['mean'])

#Error bar.
error_y=dict(type='data', array=final_array['std'].values, visible=True)

#Graph mean with standard deviation of each variable.
graph1 = {'x': final_array.index, 'y': final_array['mean'].values, 'error_y':error_y,'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Support Vector Machines Weights, with error bars'}

ply.offline.iplot(fig)

#Grab coefficents with absolute values greater than 0.5, this can be set by the user.
cutoff = 0.5
svm_voi = []
for index, columns in final_array.iterrows():
    if (columns['mean'] > cutoff) or (columns['mean'] < -cutoff):
        svm_voi.append(index)
In [15]:
#Make a dataframe of the training data from the last iteration above.
df_tested_on = svm_rainfall.iloc[train_indices]

#Get the support vectors from the trained model.
df_support = df_tested_on.iloc[svm_clf.support_,:]

#Add response variable back to expalantory datasets.
df_support['RainTomorrow'] = y[svm_clf.support_]
svm_rainfall['RainTomorrow'] = y

#Group observations by the response variable, 1's together and 0's together.
df_grouped_support = df_support.groupby(['RainTomorrow'])
df_grouped = svm_rainfall.groupby(['RainTomorrow'])

#Plot kernal desity extimation of variables with weights higher than the user defined value set above, 0.5 in this case.
vars_to_plot = svm_voi

for v in vars_to_plot:
    plt.figure(figsize=(10,4))
    #Plot support vector stats.
    plt.subplot(1,2,1)
    ax = df_grouped_support[v].plot.kde() 
    plt.legend(['no rain','rained'])
    plt.title(v+' (Instances chosen as Support Vectors)')
    
    #Plot original distributions
    plt.subplot(1,2,2)
    ax = df_grouped[v].plot.kde() 
    plt.legend(['no rain','rained'])
    plt.title(v+' (Original)')
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Model Advantages

Both Logistic Regression (LR) and Support Vector Machine (SVM) are algorithms used for classification problems. LR and SVM with linear Kernel generally perform comparably in practice. While this is true in practice, there are 2 differences to note:

  • Logistic loss diverges faster than hinge loss. So, in general, it will be more sensitive to outliers.
  • Logistic loss does not go to zero even if the point is classified sufficiently confidently. This might lead to minor degradation in accuracy

For the purposes of the "Predict Rain Tomorrow in Australia" dataset, the accuracy for both LR and SVM hovered around 85% - no superior performance detected from one to the other. This is probably the case since our pre-processing removed any outliers from the data set. We ran the alorithm 5 times for both to see if it would have any signifiant deviation - both cases proved to have the same accuracy. What was noticable was the amount of time required to run LR and SVM. We included a time gauge to measure the amount of time it required, here were the results:

  • Logistic Regression (seconds)

    • 1st: 0.94
    • 2nd: 1.85
    • 3rd: 2.89
    • 4th: 3.75
    • 5th: 4.68
  • Support Vector Machine (seconds)

    • 1st: 405
    • 2nd: 856
    • 3rd: 1259
    • 4th: 1706
    • 5th: 2174

This was a large dataset so SVM took 464 times longer to complete for all 5 runs when compared to LR. Being that both performed near identical in accuracy, we conclude LR would be more appropriate for this dataset in terms of time and resources required to complete the test.

Interpret Feature Importance

Weights depect the association between each of the predictors and what we are trying to predict - will it rain in Australia tomorrow. However, in LR and SVM, it would be tempting to conclude that variables with larger weights/coefficients are more important because it points to more signficant chances of it rains or not. Before reaching a conclusion, it is important to identify whether all the predictors are in the same units - which was not for this data set. In reality, this would be comparing apples to oranges. So prior to moving forward, it was important to standardize all the predictors. Here are the graphical results for LR and SVM:

Judging the two charts, both LR and SVM had identical predictors for its top 3 and bottom 3:

  • Top 3
    • Humidity3pm
    • Pressure9am
    • WindGustSpeed
  • Bottom 3
    • WindSpeed3pm
    • Temp3pm
    • Pressure3pm

The remaining predictors in the between had different weights, thus ranked in different order. We will use RainToday (#4 in LR) , MaxTemp (#4 in SVM), MinTemp (#5 in LR), and Rainfall (#5 in SVM) to explain why some variables are more important between LR and SVM.

STILL NEED TO PROVIDE THE DIFFERENCES. WILL GET TO LATER THIS EVENING - ANDY

Interpret Support Vectors

DO NOT KNOW WHAT STOCHASTIC GRADIEN DESCENT IS. WILL NEED TO DISCUSS MORE ON FRIDAY EVENING CALL - ANDY
In [ ]: